12/20/2021

Introduction

What to do

Our analysis consists in different parts.

  1. First of all, we want to describe the gender equality inside the company and understand if there are discrepancies and how we can solve those problems.

  2. Secondly, we want to understand the attrition factor and have a clear comprehension of what is the structure of the officies between Barcelona and London. This would be very helpful in otder to decrease the percentace of people who want to leave and how the company can improve the level of comfort of its employees.

  3. Thirdly, we think is very important to describe the attitude of our employees and what it their satisfaction level. So, combining this solution with the previous question we can improve our decision and suggest to the company where it can adopt new HR methodologies.

  4. Fourthly, we want to understand how the company can find its successor and what could be the best solution.

  5. To conclude, there is also the good practice to import in our project new datasets that could improve the decions that we can take.

Description

In this report we consider that the new CEO of a specific IT company has contacted us because she wants us to analyze the current Human Resources status of the company.   She has just sent a data set with all available employee information. As we can see, the company has two locations: the first one in London, and the second one in Barcelona.

The new CEO is concerned about several issues. She truly believes in gender equality in organizations as it implies a signal to society. On the other hand, she is concerned that the offices in Barcelona do not follow a similar structure to the one in London. In her opinion, the structure of the Barcelona offices should tend towards the London structure. In her meeting with us, she also told us that she would like to know the attitudes (e.g., satisfaction) of the employees across the different departments and if anything could be done to improve them. Finally, she commented that she is very concerned about the company’s succession strategy and in particular some positions in certain departments.

Let’s consider that the new CEO of a specific IT company has contacted us because she wants us to analyze the current Human Resources status of the company. She has just sent a data set with all available employee information. This information is in the attached data set. As we can see, the company has two locations: the first one in London, and the second one in Barcelona.

Setup the software

The software used for the development of the study and the writing of the report is R[1]. The first step is to define the work directory and to load the libraries needed:

Importing Data

source("functions_script.R")
mydb <- read.csv2("dataset.csv")

The first step is to load the dataset in the system, and check the names of the variables.

mydb <- read.csv2("dataset.csv")

Cleaning Data

Names

In this sub-point we are going to change the names of the variables in order to have all he names of the variables with the same layout.

mydb <- mydb %>% clean_names(., "snake")

Dimensions

First of all, we are going to check the actual dimension of our dataset. Hence, from the following code we can understand that there are 36 variables in total and

mydb %>% dim()
## [1] 1506   36
mydb %>% nrow()
## [1] 1506
mydb %>% ncol()
## [1] 36

Head and Tail

Here, we are going to check the first 10 elements at the beginning and at the end of the dataset. Consecutively, we are going to check the top and the bottom values of the main relevant variables to catch some errors.

mydb %>% head(10)
mydb %>% tail(10)
mydb<-rename(mydb, age =age)
mydb %>% arrange(desc(age)) %>% top_n(10, age) 
mydb %>% arrange(age) %>% top_n(-10, age) 

Removing

In this part of the data cleaning we are going to remove all the blank rows, the duplicates and strange values that may affect our analysis.

# Remove blank rows and columsn 
mydb <- mydb %>% remove_empty(c("rows", "cols"))

# Removing entries with too high and too low age
mydb <- mydb %>% filter(age <= 80 & age >= 16)
mydb <- mydb %>% filter(job_involvement <= 4)
mydb <- mydb %>% filter(num_companies_worked >= 0)

Therefore, as we can see, this line of code did not affected our dataset. So, this mean that there are no rows or columns that are empty.

Now, we are going to pass to the study of duplicates, by the employee_number variable that we suggest it is the key.

# Duplicates removal
mydb %>% get_dupes(employee_number)
mydb <- mydb %>% distinct(employee_number, .keep_all= TRUE)

Checking n’s

Hence, now it is time to check the n’s.

Droping na

To conclude, the cleaning of the dataset, we are going to remove every line with at least one empy gap.

mydb <- mydb %>% drop_na()

From know on we can easiliy proceed with our analysis.

Analysis

Introduction Analysis

Plot job role frequency

Gender Analysis

First of all, we want to understand the salary that sex has. This would give us a better overview on how the salary is distributed inside the company. Moreover, we will also take into account the possibile differences that we have in the Barcelona and London headwarters.

Monthly Income

It is pretty clear that the monthly income could be one of the most important variable that will determine the differences inside a compnay taking into accout the gender analysis.

Introduction

summary(mydb$monthly_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1009    2909    4930    6510    8386   19999

Hence, from this graph we can see that most of the employee have a salary lower than the average. In ddition, we can also see that generally we have the dotted male line almost always above the female line, but this can be considered acceptable, due to the fact that in our dataset we have more men and women.To conclude, we can also see that the major discrepancies are given from the lower range of salary, while the higher is the salary, the lower are the differences between the female and male monthly income.

Now, we will take a closer look to the monthly income between male an female . So, we can observe that the male have a worse results. While, the female has in medium an higher salray than man. Clearly, this difference is so minimal that we need to go further in detail to clearly understand if we can adopt any strategy to balance the gender inside a company or not.

So, due to we also need to understan where is located this difference, we will filter by country.

In this differentiation is definetely more clear how in Barcelona the monthly salary seems to privilege the female gender despite the male one. While, in London even if women has slightly higher average monthly salary, this is so small that can be omitted. In conclusion, we can focus to Barcelona and try to identify here the causes of this differences.

Department

Therefore, in this graph we can see how inside the company we have the majority of employees in the field of research and development. Moreover, also here we can see that in general the male sex has an higher frequency in each department compared to to female one.

Also here we found pretty important to stat that the differences between Barcelona and London to decide ultimately where to take decisions. Now, we can start seeing that the major differences about the number of employees and their distribution is mainly in London, while in Barcelona for instance, In the HR department we have almost the same number of employee per gender. This balance inside the company is a good sign of gender equality, even if there are some department, such as sales and R&D where the male sex is predominant In both London and Barcelona.

Job Role

The same analysis as above is represented below taking into account the job_role variable.

This two graphs seems to be relevant in some job roles. In fact, as we can see in Barcelona, women play an essential role in the two Director job roles.

Seniority

Taking in consideration both company location In this graphs in order to show that up in a better way, we decided to remove the confidence interval and the points. So, as we can see here, generally speaking the female gender has an higher salary compare to the male gender. This is true till approximately the 20th year of seniority inside the company where the male gender will overcome the female monthly income in a dramatic way.

Taking in consideration London In the specific case of London we can stat that generally speaking the monthly income is well balanced between male and female work positions, but around the 15th year of seniority the male gender will have a boost in therm of monthly income.

Taking in consideration Barcelona Now, in Barcelona we can see hat the results we obtained is pretty different than before. In fact, the female gender has generally an higher income compared to the male gender.

Other interesting graphs we can draw are the one related to the age.

Education field

Performance rating education field.

So, in the first graph we can see that there are some differences. Hence we decided to locate them and understood that the major problems are in Barcelona. In fact, in the HR sector, women have an higher salary than men, also for technical degree we can see how the female gender seems to have a clear advantage on that.

Clearly, per education field we can see also that in some cases also the male gender seems to have a slightly higer salary, but the main focus are those from which the discrepancy are very high, such as HR in Barcelona and Technical Degree, always in Barcelona.

Moreover, we want also to understand if for the same job role the gender has a crucial role in the determination of the monthly income of each employee.

Hence, we can understand that also in this case the majority of the problem related to gender are located in Barcelona.

Regression

To conclude, we wanted also to see if the gender is a relevan variable that affect monthly_income or other variables relavant for our study.

mod1 <- lm(monthly_income ~ ., data = mydb)
summary(mod1)$r.squared
## [1] 0.9445084
mod2 <- lm(monthly_income ~ gender, data = mydb)
summary(mod2)$r.squared
## [1] 0.0008905776
mod3 <- lm(monthly_income ~ age + gender + business_travel + education_field + education + distance_from_home + job_level + job_role + marital_status + total_working_years + performance_rating + years_at_company, data = mydb)
summary(mod3)$r.squared
## [1] 0.9434651
rm(mod1, mod2, mod3)

Hence, after this tree linear regression models we finally understood that actually the gender variable isn’t significative to determine the monthly income. Anyway, this doesn’t means that there are no discrepancy between gender, but simply that we need to develop an internal company analysis understand how in Barcelona for some kind of roles women and men have higher salaries compared to the opposite sex.

Attrition Analysis

First of all, we want to understand how the percentage of attrition is distributed inside the company. Then, we will differenciate per country.

Fortunately, as we can see the percentage of attrition between Barcleona and London is almost the same, so we can proceed with a generalized analysis.

Gender vs Attrition

So, as we can see male tend to leave the comapny more the women does, but we also have to take into account the fact that there are more men in the company. Hence, the relation seems to be pretty balanced.

Marital status vs Attrition

Here, we can see how single people tend to leave the company more, compared to married and divorced.

The business_travel variable could be a very important variable that will tell us if the travels affect the attrition variable. At the end, we have that the main obs that affect the attrition is the travel_rarely, which could that the non routine and the change of plans for the single employee could cause an higher attrition. Hence, we suggest to clarify with higher advance if that employee have to travel or not, and finally take the new results and compute a further analysis.

The department variable could be a very important variable that will tell us if the department affect the attrition variable.

Here instead, we can see how the R&D department seems the one with higher attrition. Anyway, we have also have to take into consideration that the majority of the people inside the company work in this department.

Now, we want to study also the job_role versus the attrition variable. This would tell us what are the most common job positions that will affect this variable and where we can work in order to reduce the percentage of people who leave the company in the specific job role. In this case, Laboratory Technician, Sales Executive, Research Scientist and Sales Representative, tend to leave the organisation more than others.

Another interesting graph that we can draw is the proportion of education version attrition.

Hence, life sciences, meical will leave the company mroe frequently than the others.

Another variable that can affect the attrition is the dummy variable if the employees work over time or not. Therefore, we can conclude that if employees work more overtime they would tend to leave the company more than the people who do not. This variable is extremely significant becase as we can see the majority of the workers don’t work overtime.

Regression

In order to deal with attrition we have to convert the yes or no variable into a binary variable. Once we did that we can work with the logistic regression model. Obviously, we are dealing now with a dummy variable. Hence, we can try to understand what are the variables that will affect our attrition.

mydb$attrition <-ifelse(mydb$attrition=="Yes",1,0)

mod1 <- glm(attrition ~ ., data = mydb)
summary(mod1)$r.squared
## NULL
mod2 <- glm(attrition ~ age + business_travel + distance_from_home + environment_satisfaction + job_involvement + job_role + job_satisfaction + num_companies_worked + over_time + relationship_satisfaction + training_times_last_year + work_life_balance + years_since_last_promotion, data = mydb)
summary(mod2)$r.squared
## NULL
rm(mod1, mod2)

To sum up, we can see that in the first regression we can see the diferent types of variables and their correlations, while in the second model we took only those variable with a low p-value in the mod1.

In conclusion the mod2 shows all the variables that have a significative effect in the the definition of the attrition variable.

Satisfaction analysis

In this section we want to analyze the level of satisfaction of the employees and of the workplace more in general to see if there are some differences between the factors and characteristics that we are going to consider. For this reason our studyings are focused on the variable and their trends:

1-job_satisfaction

2-environment_satisfaction

1-Job_satisfaction

Gender and Department

Considering the satisfaction of our employees to make a first summary of the data we represent its value in each department differentiated per gender. For this section we use some boxplot to explain our results as to have an idea on the main distribution of the values between different factor we will consider later.

In this first general overview we can observe a constant trend per each department, also due to the fact that we have a limited choice of value for the variable “job_satisfaction”and so it’s normal to see similar pattern. It’s relevant the distribution for male in the Human resource business unit, where we see that the median and the 1st quantile are overlapped and then presents the lowest value of satisfaction.

Differences between the two cities in term of job satisfaction

A possible way to go further in our analysis is to study the possible differences in the feelings of the workers between London and Barcellona to understand if there are some parameters or other point to go into detail. In this way we compare for each department the level of satisfaction per gender.

Sales department , Gender

Starting from Sales department we do not observe any peculiarity as the 2 graph are equal ; in general we can assume a medium level of satisfaction of the workers in this field, which stands at the value “3”.

R&D department , Gender

For R&D we observe another time a regular trend close to 3 as average value of satisfaction. The only relevant note is the equality between the median and the 3rd quantile at the value 3, but except for that we can consider this as a medium satisfaction level department.

HR department , Gender

Going deep in H&R department we see that the previous graph present inside it some interest sides ; in fact it is clearly evident from this division some difference between the 2 cities. For London size we observe that the median in located between 2 and 3 for both male and female gender , with maximum values for male that reach 4 , that is also the 3rd quantile. On the other side in Barcelona we see a big discrepancy inside gender : in fact considering male they present lowest values, a median which is equal to 2 (medium low value) and also with 1 as 1st quantile. Opposite is the trend for female in which we can observe a median equal to 4 which indicates a good level of satistaction. That is can due to the fact that , doing further analysis on the number of employees, there is a clear predominance of female in this department in Barcelona and so this a general course could be not the best for the integration of male in this sector.

However to have a main view of the graph presented as now we resume them in this picture :

mean_mi <- round(mean(mydb$monthly_income),2)
median_mi <- round(median(mydb$monthly_income),2)

Environemnt_satisfaction

Environment Satisfaction per department

Now we focus on the satisfaction on the workplace and we try to go deepen if there are some interesting aspects into the single department.

Generally there is a medium level of satisfaction dor R&D and Sales department while we see a medium negative attitude in Human Resources . However this pattern is not so oustanding , reason for what we decide to go deepen in considerations per gender and per city.

London , Male

Firstly we have filtered per “Male” and “London” and as in the previous analysis we observe a medium low trend for male in Human Resources department. For the other two business unit we observe a constant trend that we have found also considering job_satisfaction.

London , Female

As we were supposing to see the medium low level in Human Resources for male is balanced in the female as to reach in general the medium trend (=3) of environment satisfaction which we can also find for the other 2 department.

Barcelona , Male

For the city of Barcelona it’s confirmed the unsatisfaction( in this case related if the environment) for male in Human Resource department : in fact even if we have 1st quantile = 4 , we see the median equal to 2 , value that can not be appreciated by the company. R&D presents a positive average of values with the 1st quantile =4 and the median=3 while the Sales department has lowest value distribution that stay most in a range between 2 and 3.

Barcellona , Female

In this last graph , we do not see differences from the boxplot of R&D and Sales which are the same as the one of male. However, as we were expecting , Human Resources department follows a pretty positive course , thus going against the trend of the other sex.

This series of graph so confirms the big differences between male and female satisfaction level just for the city of Barcelona.

Regression

We firstly try to interpretate the variable “job_satisfaction” and “environemnt_satisfaction” using the linear regression model :

## [1] 0.04775216
## [1] 0.05169541
## [1] 0.000977927
## [1] 0.0003723011

We can observe that both “gender” and “city” are not relevant to explain the two variables studied “job_satisfaction” and “environment_satisfaction”. Anyway is it clear from the previous part that there is a more pronounced dissatisfaction among Spanish male workers and so in this sense would be useful to understand the motivations of this general mood and also the difference between them and London workers in their workplace. A more deepen analysis on the behavior of the workers in this company, the approach between males and females and , more than all , their growth prospects within the company : in our opinion this could be the key in order to solve this problem.

Succesion strategy

## [1] 2.687500 2.740622 2.682028

As succession strategy we decide to take in consideration 3 indicator to observe, the first one will analyse 2 different indicator that will help us in understand which department need to change the leader. The indicator are :

  • the mean of environment satisfaction
  • the mean of the performance rating
  • average of years with current manager

Based on that we will try first to change leader in the department with lower performance rating or environment satisfaction. Then once we choose the department on which we want to have change we will try to evaluate all the attitude, and job performance information that we can have relate to a worker and see if he can cover the role designed for him. In particular the list of variable that we will analyse are, total working year, years at company, relationship satisfaction, number of companies worked, job role and job level. The job level will be the most important cause the one with already an high level of job level will be the one that we will privilege.

Adding dataset

UK Comparation

## Warning: Removed 150 rows containing non-finite values (stat_count).

As we can see from this graph in the Uk if people work in a company which is small the satisfaction of the employee tends to increase with respect to a big vompany

Carrer satisfaction

Barcelona Comparation

Introduction Analysis

summary(barcelona_data)
##       year          sex                age            monthly_income 
##  Min.   :2021   Length:2266        Length:2266        Min.   :  250  
##  1st Qu.:2021   Class :character   Class :character   1st Qu.: 1001  
##  Median :2021   Mode  :character   Mode  :character   Median : 2250  
##  Mean   :2021                                         Mean   : 2686  
##  3rd Qu.:2021                                         3rd Qu.: 4000  
##  Max.   :2021                                         Max.   :10000
tabyl(barcelona_data$sex)
##  barcelona_data$sex    n   percent
##              Female 1122 0.4951456
##                Male 1144 0.5048544

From this data we can see how the general mean of the monthly_income is €2686. Moreover, we can see how well this dataset is distributed between Male and Female.

With the cleaned new dataset, we can make some analysis in order to have a general overview and can do a comparison between the starting datset solution and the new one.
In this case, we can see how in Barcelona Male has generally an higher salary compared to women.

To conclude, we can understand here how the people between 65 and 74 years old have higher monthly_income.

Conclusions

In the analysis, we reach some conclusions let’s take a look at what we reach.

We started by analyzing the frequency of the different roles that we have in the dataset, we obtained that looking at the whole data set without differentiating by city the most frequent role are:

  • Sales Executive
  • Research Scientist
  • Laboratory technician

The next step is related to see if this path is followed also in the two cities. The second graph shows us that the distribution of the role is the same in both cities.

Proceeding forward, we shift on the gender analysis to understand if both companies reach the goal of gender equality.

We first calculate some statistics indicator of the distribution of the monthly income, obtaining an average monthly income of 6510 euro.

We plotted a graph that shows the general situation of the whole company, and there we can see that on average the monthly income of the female employee tends to be higher. The average monthly income for males is 6395.65 while for females is 6682.85 euros.

The next step will be to analyze the same point but after a distinction between the two different cities.

In both the city the monthly income of the female is higher than the male, but the gap in Barcelona is more significant.

Let’s shift to the result that takes into consideration also the department to which the employee belongs.

In general, the number of males per department is always higher concerning the female, and the majority of personnel work in the Research & Development.

As before we will take a look also in the two different cities. The gap between the gender per department is very high in London, but in Barcelona, the HR department respect gender equality, so our CEO needs to work firstly on the department of Sales and Research & Development.

Let’s shift now to the job role, in all the roles we have more males than females. From the graph where we don’t differentiate between the city, all the role has more male than girl except the Manufacturing director role where the number of male and female are almost the same. We can state the same conclusion also if we analyze only London. In the case of Barcelona, we have a slight difference, the female is higher in the three roles.

  • Manager
  • Research Director
  • Manufacturing director

Now, we want to write done the result obtained once we put in relation the monthly income versus the years in the company differentiating by gender and then by the city also.

The whole picture dhow us that in the initial stage the female tends to get more money than the male, then in the middle stage from 8 years till 20 the income is almost the same, but in the final stage so after 20 years of seniority the gap between the female and male increase in the favor of the male.

Looking at the two cities, we can state the same things only the duration of the middle stage change, because now the male start earns more money 15 years of seniority. In Barcelona the picture is different, female employees tend to earn more almost always.

Let’s consider the monthly income of the different departments and differentiate also by gender.

In the graph where we consider at the same time both the city, we can see that the differences in income are present in the education field of technical degree and Human resource causes the mean of female income for the male is bigger. Considering the two cities the problem is still present but in Barcelona is bigger.

We carried out the same type of analysis but now focus on the job role instead of the education field. We obtained that in general the income as the same path for both male and female, but in two roles which are Research Director and Manager the male tend to earn more than the female. In Barcelona, the discrepancy is present for both causes in some roles males earn on average more than females and also the contrary happens. While in London only the research director shows a big gap in favor of the male.

For what concerns the variable Attrition we assumed at first it depends on many factors, in particular gender , city that are not relevant. Other factors, were found out to be relevant like marital_status but we cannot modify them in any way as they depend on the personal life of the workers. Most important we recognize the significance of some variables like business_travel , department , that could be improved by paying more attention at some key point :

  • business_travel <- let workers enter the travel logic to ensure that this does not become a negative impact factor on them.
  • department <- pay more attention in the critical department in which we have higher values of people that want to leave the company, like in this case R&D.

On the other side looking at the Satisfaction ( job_satisfaction , environment_satisfaction ) into our company we noticed some important differences for the city of Barcelona per genre, as there was an important slack between males and females perceptions. Through the linear model, we have seen that in general, they are not significant to explain the phenomenon in itself. However, to improve the issue found about Barcelona’s worker’s satisfaction could be relevant analyze and eventually make wider the growth prospects for the employees in the company to entice them to make appreciate more their work and keep united the workplace to pursue company’s objectives.

To understand if our analysis is made on reliable data we take into consideration 2 new dataset one relates to Barcelona and the other relate to the UK.

In the UK dataset, we compare how the size of the company can influence the high level of job satisfaction. In particular, we saw that the level of job satisfaction increases a lot when the company has a medium size,(20 to 90 employees).

We carried out the same analysis but now consider the career satisfaction, and the result that we obtain follows the one that we enunciate before. The career satisfaction for the employee is higher in a company of medium size.

We mainly decide to carry out this type of analysis cause we saw that in the dataset we have a predominance of observations related to London, which can be explicated in this way, the dimension of the company in London is bigger than the dimension in Barcelona.

We decide to take into consideration the Barcelona dataset to enlighten the result that we obtained from the gender analysis, to see if they reflect the situation of Barcelona in general or it is only related to the company that we are analyzing. We mainly see that in Barcelona the average monthly income is almost the same between males and females. So the CEO should take into consideration to run further analysis to understand why we have so much difference between the male and female monthly income.

References

[1] R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

[2] Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

[3] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

[4] Barret Schloerke, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg and Jason Crowley (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2. https://CRAN.R-project.org/package=GGally

[5] Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. https://CRAN.R-project.org/package=gridExtra

[6] Sam Firke (2021). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.1.0. https://CRAN.R-project.org/package=janitor

[7] Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi: 10.21105/joss.00037 (URL: https://doi.org/10.21105/joss.00037), <URL: http://dx.doi.org/10.21105/joss.00037>.

[8] Jim Hester (2021). glue: Interpreted String Literals. R package version 1.5.0. https://CRAN.R-project.org/package=glue

[9] Hadley Wickham and Dana Seidel (2020). scales: Scale Functions for Visualization. R package version 1.1.1. https://CRAN.R-project.org/package=scales

[10] C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.

[11] Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer